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ABSTRACT 


For  most  proteins  in  the  genome  databases,  function  is  predicted  via  sequence  comparison.  In  spite 
of  the  popularity  of  this  approach,  the  extent  to  which  it  can  be  reliably  applied  is  unknown.  We 
address  this  issue  by  systematically  investigating  the  relationship  between  protein  function  and 
structure.  We  focus  initially  on  enzymes  classified  by  the  Enzyme  Commission  (EC)  and  relate 
these  to  structurally  classified  proteins  in  the  SCOP  database.  We  find  that  the  major  SCOP  fold 
classes  have  different  propensities  to  carry  out  certain  broad  categories  of  functions.  For  instance, 
alpha/beta  folds  are  disproportionately  associated  with  enzymes,  especially  transferases  and 
hydrolases,  and  all-alpha  and  small  folds  with  non-enzymes,  while  alpha+beta  folds  have  an  equal 
tendency  either  way.  These  observations  for  the  database  overall  are  largely  true  for  specific 
genomes.  We  focus,  in  particular,  on  yeast,  analyzing  it  with  many  classifications  in  addition  to 
SCOP  and  EC  (i.e.  COGs,  CATH,  MIPS),  and  find  clear  tendencies  for  fold-function  association, 
across  a  broad  spectrum  of  functions.  Analysis  with  the  COGs  scheme  also  suggests  that  the 
functions  of  the  most  ancient  proteins  are  more  evenly  distributed  among  different  structural  classes 
than  those  of  more  modern  ones.  For  the  database  overall,  we  identify  both  most  versatile  functions, 
i.e.  those  that  are  associated  with  the  most  folds,  and  most  versatile  folds,  associated  with  the  most 
functions.  The  two  most  versatile  enzymatic  functions  (hydro-lvases  and  O-glycosyl  glucosidases) 
are  associated  with  7  folds  each.  The  five  most  versatile  folds  (TIM-barrel,  Rossmann,  ferredoxin, 
alpha-beta  hydrolase,  and  P-loop  NTP  hydrolase)  are  all  mixed  alpha-beta  structures.  They  stand 
out  as  generic  scaffolds,  accommodating  from  6  to  as  many  as  16  functions  (for  the  exceptional 
TIM-barrel).  At  the  conclusion  of  our  analysis  we  are  able  to  construct  a  graph  giving  the  chance 
that  a  functional  annotation  can  be  reliably  transferred  at  different  degrees  of  sequence  and 
structural  similarity.  Supplemental  information  is  available  from 
http  ://bioinfo  .mbb .  yale.edu/genome/foldfunc . 
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INTRODUCTION 


The  Problem  of  Determining  Function  from  Sequence 

An  ultimate  goal  of  genome  analysis  is  to  determine  the  biological  function  of  all  the  gene  products 
in  a  genome.  However,  the  function  of  only  a  minor  fraction  of  proteins  has  been  studied 
experimentally,  and,  typically,  prediction  of  function  is  based  on  sequence  similarity  with  proteins 
of  known  function.  That  is,  functional  annotation  is  transferred  based  on  similarity.  Unfortunately, 
the  relationship  between  sequence  similarity  and  functional  similarity  is  not  as  straightforward.  This 
has  been  commented  on  in  numerous  reviews  (Bork  &  Koonin,  1998;  Karp,  1998).  Karp  (1998),  in 
particular,  has  noted  that  transferring  of  incorrect  functional  information  threatens  to  progressively 
corrupt  genome  databases  through  the  problem  of  accumulating  incorrect  annotations  and  using 
them  as  a  basis  for  further  annotations  and  so  on. 

It  is  known  that  sequence  similarity  does  confer  structural  similarity.  Moreover,  there  is  a  well- 
established  quantitative  relationship  between  the  extent  of  similarity  in  sequence  and  that  in 
structure.  First  investigated  by  Chothia  &  Lesk,  the  similarity  between  the  structures  of  two  proteins 
(in  terms  of  RMS)  appears  to  be  a  monotonic  function  of  their  sequence  similarity  (Chothia  &  Lesk, 
1986).  This  fact  is  often  exploited  when  two  sequences  are  declared  related,  based  on  a  database 
search  by  programs  such  as  BLAST  or  FastA  (Altschul  et  al,  1997;  Pearson,  1996).  Often,  the  only 
common  element  in  two  distantly  related  protein  sequences  is  their  underlying  structures,  or  folds. 

Transitivity  requires  that  the  well-established  relationship  between  sequence  and  structure  and  the 
more  indefinite  one  between  sequence  and  function  imply  an  indefinite  relationship  between 
structure  and  function.  Several  recent  papers  have  highlighted  this,  analyzing  individual  protein 
superfamilies  with  a  single  fold  but  diverse  functions.  Examples  include  the  aldo-keto  reductases,  a 
large  hydrolase  superfamily,  and  the  thiol  protein  esterases.  The  latter  include  the  eye-lens  and 
corneal  crystallins,  a  remarkable  example  of  functional  divergence  (Bork  &  Eisenberg,  1998;  Bork 
et  al.,  1994;  Cooper  et  al.,  1993;  Koonin  &  Tatusov,  1994;  Seery  et  al.,  1998). 

There  are  also  many  classic  examples  of  the  converse:  the  same  function  achieved  by  proteins  with 
completely  different  folds.  For  instance,  even  though  mammalian  chymotrypsin  and  bacterial 
subtilisin  have  different  folds,  they  both  function  as  serine  proteases  and  have  the  same  Ser-Asp- 
His  catalytic  triad.  Other  examples  include  sugar  kinases,  anti-freeze  glycoproteins,  and  lysyl-tRNA 
synthetases  (Bork  et  al.,  1993;  Chen  et  al.,  1997;  Doolittle,  1994;  Ibba  et  al.,  1997a;  Ibba  et  al., 
1997b). 

Figure  1  shows  well-known  examples  of  each  of  these  two  basic  situations:  same  fold  but  different 
function  (divergent  evolution)  and  same  function  but  different  fold  (convergent  evolution). 

Protein  Classification  Systems 

The  rapid  growth  in  the  number  of  protein  sequences  and  3D  structures  has  made  it  practical  and 
advantageous  to  classify  proteins  into  families  and  more  elaborate  hierarchical  systems.  Proteins  are 
grouped  together  on  the  basis  of  structural  similarities  in  the  FSSP,  (Holm  &  Sander,  1998)  CATH 
(Orengo  et  al.,  1997),  and  SCOP  databases  (Murzin  et  al.,  1995).  SCOP  is  based  on  the  judgments 
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of  a  human  expert;  FSSP,  on  automatic  methods;  and  CATH,  on  a  mixture  of  both.  Other  databases 
collect  proteins  on  the  basis  of  sequence  similarities  to  one  another  —  e.g.  PROSITE,  SBASE, 

Pfam,  BLOCKS,  PRINTS  and  ProDom  (Attwood  el  al.,  1998;  Bairoch  el  al.,  1997;  Corpet  el  al., 
1998;  Fabian  et  al.,  1997;  Henikoff  et  al.,  1998;  Sonnhammer  et  al.,  1997).  Several  collections 
contain  information  about  proteins  from  a  functional  point  of  view.  Some  of  these  focus  on 
particular  organisms  -  e.g.  the  MIPS  functional  catalogue  and  YPD  for  yeast  (Mewes  et  al.,  1997; 
Hodges  et  al.,  1998)  and  EcoCyc  and  GenProtEC  for  E.  coli  (Karp  et  al.,  1998;  Riley,  1997). 

Others  focus  on  particular  functional  aspects  in  multiple  organisms.  -  e.g.  the  WIT  and  Kegg 
databases  which  focus  on  metabolism  and  pathways  (Selkov  et  al.,  1997;  Ogata  et  al.,  1999),  the 
ENZYME  database  which  focuses  obviously  enough  on  enzymes  (Bairoch,  1996),  and  the  COGs 
system  which  focuses  on  proteins  conserved  over  phylogenetically  distinct  species  (Tatusov  et  al., 
1997).  The  ENZYME  database,  in  particular,  contains  all  the  enzyme  reactions  that  have  an  “EC 
number”  assigned  in  accordance  with  the  International  Nomenclature  Committee  and  is  cross- 
referenced  with  Swissprot  (Bairoch,  1996;  Bairoch  &  Apweiler,  1998;  Barrett,  1997). 

Our  approach:  Systematic  Comparison  of  Proteins  Classified  by  Structure  with  those 
Classified  by  Function 

One  of  the  most  valuable  operations  one  can  do  to  these  individual  classification  systems  is  to 
cross-reference  and  cross-tabulate  them,  seeing  how  they  overlap.  We  perform  such  an  analysis  here 
by  systematically  interrelating  the  SCOP,  Swissprot  and  ENZYME  databases  (Bairoch,  1996; 
Bairoch  &  Apweiler,  1998;  Murzin  et  al.,  1995).  For  yeast  we  also  have  used  the  MIPS  yeast 
functional  catalogue,  CATH,  and  COGs  in  our  analysis.  This  enables  us  to  investigate  the 
relationship  between  protein  function  and  structure  in  a  comprehensive  statistical  fashion.  In 
particular,  we  investigated  the  functional  aspects  of  both  divergent  and  convergent  evolution, 
exploring  cases  where  a  structure  gains  a  dramatically  different  biochemical  function  and  finding 
instances  of  similar  enzymatic  functions  performed  by  unrelated  structures. 

We  concentrated  on  single-domain  Swissprot  proteins  with  significant  sequence  similarity  to  one  of 
the  SCOP  structural  domains.  Since  most  of  these  proteins  have  a  single  assigned  function, 
comparing  them  to  individual  structural  domains,  which  can  have  only  one  assigned  fold,  allowed 
us  to  establish  a  one-to-one  relationship  between  structure  and  function. 

Recent  Related  Work 

This  work  is  following  up  on  several  recent  papers  on  the  relationship  between  protein  structure  and 
function.  In  particular,  Martin  et  al.  studied  the  relationship  between  enzyme  function  and  the 
CATH  fold  classification  (Martin  et  al.,  1998).  They  concluded  that  functional  class  (expressed  by 
top-level  EC  numbers)  is  not  related  to  fold,  since  a  few  specific  residues,  not  the  whole  fold, 
determine  enzyme  function.  Russell  also  focused  on  specific  sidechain  patterns,  arguing  that  these 
could  be  used  to  predict  protein  function  (Russell,  1998).  In  a  similar  fashion,  Russell  et  al. 
identified  structurally  similar  “supersites”  in  superfolds  (Russell  et  al.,  1998).  They  estimated  that 
the  proportion  of  homologues  with  different  binding  sites  —  and  therefore  with  different  functions  — 
is  around  10%.  In  a  novel  approach,  using  machine  learning  techniques,  des  Jardins  et  al.  predict 
purely  from  the  sequence  whether  a  given  protein  is  an  enzyme  and  also  the  enzyme  class  to  which 
it  belongs  (des  Jardins  et  al.,  1997). 
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Our  work  is  also  motivated  by  recent  work  looking  at  whether  or  not  organisms  are  characterized  by 
unique  protein  folds  (Frishman  &  Mewes,  1997;  Gerstein,  1997;  Gerstein  &  Hegyi,  1998;  Gerstein 
&  Levitt,  1997;  Gerstein,  1998a,b).  If  function  is  closely  associated  with  fold  (in  a  one-to-one 
sense),  one  would  think  that  when  a  new  function  arose  in  evolution,  nature  would  have  to  invent  a 
new  fold.  Conversely,  if  fold  and  function  are  only  weakly  coupled,  one  would  expect  to  see  a  more 
uniform  distribution  of  folds  amongst  organisms  and  a  high  incidence  of  convergent  evolution.  In 
fact,  a  recent  paper  on  microbial  genome  analysis  claims  that  functional  convergence  is  quite 
common  (Koonin  &  Galperin,  1997).  Another  related  paper  systematically  searched  Swissprot  for 
all  such  cases  of  what  is  termed  “analogous”  enzymes  (Galperin  et  al.,  1998). 

Our  work  is  also  motivated  by  the  recent  work  on  protein  design  and  engineering,  which  aims  to 
rationally  change  a  protein  function  —  for  instance,  to  engineer  a  reporter  function  into  a  binding 
protein  (Hellinga,  1997;  Hellinga,  1998;  Marvin  et  al.,  1997). 


RESULTS 

Overview  of  the  8937  Single-domain  Matches 

Our  basic  results  were  based  on  simple  sequence  comparisons  between  Swissprot  and  SCOP,  the 
SCOP  domain  sequences  being  used  as  queries  against  Swissprot.  We  focused  on  'mono-functional' 
single-domain  matches  in  Swissprot,  i.e.  those  singe-domain  proteins  with  only  one  annotated 
function.  The  detailed  criteria  used  in  the  database  searches  are  summarized  in  the  Methods. 

Overall,  a  little  more  than  a  quarter  of  the  proteins  in  Swissprot  are  enzymes,  a  similar  fraction  are 
of  known  structure,  and  about  an  eighth  are  both.  (More  precisely,  of  the  69113  analyzed  proteins  in 
Swissprot,  19995  are  enzymes,  18317  are  structural  homologues,  and  8205  are  both.)  About  half  of 
the  fraction  of  Swissprot  that  matched  known  structures  were  “single  domain”  and  about  a  third  of 
these  were  enzymes  (8937  and  3359,  respectively,  of  18317).  We  focus  on  these  8937  single¬ 
domain  matches  here.  Notice  how  these  numbers  also  show  how  the  known  structures  are 
significantly  biased  towards  enzymes:  45%  (8205  out  of  18317)  of  all  the  structural  homologues  are 
enzymes  versus  29%  (19995  out  of  69113)  for  all  of  Swissprot. 

331  Observed  Fold-function  Combinations 

Figure  2  gives  an  overview  of  how  the  matches  are  distributed  amongst  specific  functions  and  folds. 
The  single-domain  matches  include  229  of  the  361  folds  in  SCOP  1.35  and  91  of  the  207  3- 
component  enzyme  categories  in  the  ENZYME  database  (Bairoch,  1996).  Each  match  combines  a 
SCOP  fold  number  on  the  structural  side  (columns  in  Figure  2)  and  a  3-component  EC  category  on 
the  functional  side  (rows),  with  all  the  non-enzymatic  functions  grouped  together  into  a  single 
category  with  the  artificial  “EC  number”  of  0.0.0  (shown  in  the  first  row  in  Figure  2).  This  results 
in  a  table  where  each  cell  represents  a  potential  fold-function  combination.  The  table  contains  a 
maximum  of  21068  (=229  x  92)  possible  fold-function  combinations  (and  a  minimum  of  229 
combinations,  assuming  only  one  function  for  every  fold).  We  actually  observe  331  of  these 
combinations  (1.6%,  shown  by  the  filled-in  cells). 
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Overall,  more  than  half  of  the  functions  are  associated  with  at  least  two  different  folds,  while  less 
than  half  of  the  folds  with  enzymatic  activity  have  at  least  two  functions  (51  out  of  91  and  53  out  of 
128,  respectively). 

Summarizing  the  Fold-function  Combinations  by  42  Broad  Structure-function  Classes 

As  listed  in  Table  1,  folds  can  be  subdivided  in  6  broad  fold  classes  (e.g.  all-alpha,  all-beta, 
alpha/beta,  etc.).  Likewise,  functions  can  be  broken  into  7  main  classes  —  non-enzymes  plus  six 
enzyme  classes,  e.g.  oxidoreductase,  transferase,  etc.  This  gives  rise  to  42  (6x7)  structure-function 
classes.  The  way  the  21068  potential  fold-function  combinations  are  apportioned  amongst  the  42 
classes  is  shown  in  Table  2A. 

Table  2B  shows  the  way  the  331  observed  combinations  were  actually  distributed  amongst  the  42 
classes.  Comparing  the  number  of  possible  combinations  with  that  observed  shows  that  the  most 
densely  populated  region  of  the  chart  is  the  transferase,  hydrolase  and  lyase  functions  in 
combination  with  the  alpha/beta  fold  class.  This  notion  is  in  accordance  with  the  general  view  that 
the  most  ‘popular’  structures  among  enzymes  fall  into  the  alpha/beta  class.  In  contrast,  matches 
between  small  folds  and  enzymes  are  almost  completely  missing,  except  for  five  folds  in  the 
oxidoreductase  category.  There  are  also  no  all-alpha  ligases  and  only  one  all-alpha  isomerase. 

Tables  2C  and  2D  break  down  the  331  fold-function  combinations  in  Table  2A  into  either  just  a 
number  of  folds  or  just  a  number  of  functions.  That  is,  Table  2C  lists  the  number  of  different  folds 
associated  with  each  of  the  42  structure-function  classes  (corresponding  to  the  non-zero  columns  in 
the  relevant  class  in  Figure  2).  Table  2D  does  the  same  thing  for  functions  (non-zero  rows  in  Figure 
2).  Comparing  these  tables  back  to  the  total  number  of  combinations  (Table  2 A)  reveals  some 
interesting  findings,  keeping  in  mind  that  more  functions  than  folds  reveals  probable  divergence 
and  that  more  folds  than  functions  reveals  probable  convergence.  For  instance,  the  alpha/beta  and 
alpha+beta  fold  classes  contain  similar  numbers  of  folds,  but  the  alpha/beta  class  has  relatively 
more  functions,  perhaps  reflecting  a  greater  divergence.  (Specifically,  the  alpha/beta  class  has  73 
folds  and  56  functions,  while  the  alpha+beta  class  has  67  folds  but  only  35  functions.) 

Table  2E  shows  the  number  of  matching  Swissprot  sequences  (from  the  total  of  69113)  for  each  of 
the  42  structure-function  classes.  The  most  highly  populated  categories  are  the  all-alpha  non¬ 
enzymes,  where  683  of  the  1940  matches  come  from  globins,  and  the  all-beta  non-enzymes,  where 
361  of  the  1159  Swissprot  sequences  have  matches  with  the  immunoglobulin  fold.  These  numbers 
are,  obviously,  affected  by  the  biases  in  Swissprot.  On  the  other  hand,  if  we  compare  the  total 
matches  in  Table  2E  with  the  total  combinations  in  Table  2B  it  is  clear  that  the  numbers  do  not 
directly  correlate.  For  instance,  fewer  hydrolases  in  Swissprot  have  matches  with  alpha/beta  folds 
than  with  alpha+beta  folds  (295  vs.  452),  but  the  number  of  different  combinations  in  the  first  case 
is  30,  as  opposed  to  only  18  in  the  second  case.  This  suggests  that  our  approach  of  counting 
combinations  may  not  be  as  affected  by  the  biases  in  the  databanks  as  simply  counting  matches. 

Table  2F  and  2G  give  some  rough  indication  of  the  statistical  significance  of  the  differences  in  the 
observed  distribution  of  combinations.  In  Table  2F,  using  chi-squared  statistics,  we  calculate  for 
each  individual  structure  class  the  chance  that  we  could  get  the  observed  distribution  of  fold- 
function  combinations  over  various  functional  classes  if  fold  was  not  related  to  function.  Then  in 
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table  2G,  we  reverse  the  role  of  fold  and  function,  and  calculate  the  statistics  for  each  functional 
class. 

Enzyme  versus  Non-enzyme  Folds 

On  the  coarsest  level,  function  can  be  divided  amongst  enzymes  and  non-enzymes.  Of  the  229  folds 
present  in  Figure  2,  93  are  associated  only  with  enzymes  and  101  are  associated  only  with  non¬ 
enzymes.  The  remaining  folds  were  associated  with  both  enzymatic  and  non-enzymatic  activity. 
Finally,  of  the  93  purely  enzymatic  folds,  18  have  multiple  enzymatic  functions. 

Figure  3A  shows  a  graphical  view  of  the  distribution  of  the  different  fold  classes  among  these 
broadest  functional  categories.  The  distribution  is  far  from  uniform.  The  all-alpha  fold  class  has  30 
non-enzymatic  representatives,  but  only  12  purely  enzymatic  folds  and  4  folds  with  “mixed”  (both 
types  of)  functions.  This  implies  that  a  protein  with  an  all-alpha  fold  has  a  priori  roughly  twice  the 
chance  of  having  a  non-enzymatic  function  over  an  enzymatic  one.  The  all-beta  fold  class  has  6 
enzymatic,  17  non-enzymatic  and  13  “mixed”  folds.  In  the  alpha/beta  class,  34  folds  are  associated 
only  with  enzymes  and  5  folds  only  with  non-enzymes,  whereas  in  the  alpha+beta  class  this  ratio  is 
more  balanced  —  28  'purely'  enzymatic  folds  versus  22  purely  non-enzymatic  ones. 


Restricting  the  Comparison  to  Individual  Genomes 

Figure  3A  applies  to  all  of  Swissprot.  Figures  3B  and  C  show  the  functional  distribution  of  folds 
taking  into  account  the  matches  only  in  two  specific  genomes,  yeast  and  E.  coli.  Only  a  fraction  of 
each  genome  could  be  taken  into  consideration  for  various  reasons  (156  proteins  in  yeast,  244 
proteins  in  E.coli ),  mostly  due  to  the  great  number  of  enzymes  having  multiple  domains  in  both 
yeast  and  E.coli.  Chi-squared  tests  show  that  the  fold  distribution  in  yeast  does  not  differ 
significantly  from  that  in  Swissprot  and  that  the  one  in  E.coli  differs  only  slightly  (P<0.25  and 
P<0.02,  respectively).  The  main  difference  between  Swissprot  and  E.coli  is  the  larger  fraction  of 
alpha/beta  enzymatic  folds  in  the  latter  (34/93  versus  26/49).  There  are  also  somewhat  more  non- 
enzymatic  all-alpha  and  small  folds  in  Swissprot  than  in  the  two  genomes.  This  is  principally  due  to 
the  greater  prevalence  of  globins,  myosins,  cytochromes,  toxins,  and  hormones  in  Swissprot  than  in 
yeast  and  E.  coli.  Many  of  these,  of  course,  are  proteins  usually  associated  with  multicellular 
organisms.  We  did  a  preliminary  version  of  the  fold  distribution  for  the  worm  C.  elegans.  As 
expected  this  distribution  turns  out  to  be  similar  to  that  of  Swissprot  (data  not  shown). 

The  Yeast  Genome  Viewed  from  Different  Classification  Schemes 

In  Figure  4  we  focus  on  the  yeast  genome  in  more  detail,  trying  to  see  the  effect  that  different 
classification  schemes  have  on  our  results.  Although  the  total  number  of  counts  for  our  statistics 
decrease,  of  course,  in  just  using  yeast  relative  to  all  of  Swissprot,  yeast  provides  a  good  reference 
frame  to  compare  a  number  of  classification  schemes  in  as  unbiased  a  fashion  as  possible.  Also, 
yeast  is  one  of  the  most  comprehensively  characterized  organisms,  and  there  are  a  number  of 
functional  classifications  available  exclusively  for  this  organism. 

In  part  A  we  cross-tabulate  the  structure-function  combinations  in  yeast  using  the  SCOP  and  EC 
systems  as  we  have  done  for  all  of  Swissprot  in  Table  2B.  The  yeast  distribution  is  fairly  similar  to 
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that  of  Swissprot  with  the  only  major  difference  being  somewhat  more  alpha/beta  transferases  and 
fewer  alpha/beta  hydrolases  than  expected.  (A  chi-squared  test  gives  P<~0.05  for  the  two 
distributions  to  differ.  If  either  the  transferase  or  hydrolase  difference  is  removed,  P  increases  to 
-20%. ) 

Parts  B  show  structure-function  combinations  based  on  using  the  CATH  structural  classification 
(Orengo  et  al.,  1997)  instead  of  SCOP.  For  this  sub-figure  we  mapped  the  SCOP  classification  of  a 
yeast  PDB  match  to  its  corresponding  CATH  classification  and  then  cross-tabulated  the  structure- 
function  combinations  in  the  various  classes.  Essentially,  this  subfigure  shows  the  results  of  Martin 
et  al.  (1998)  just  for  yeast. 

In  subfigures  C  and  D,  which  show  a  COGs  versus  SCOP  cross  tabulation,  we  achieve  the  opposite 
of  subfigure  B.  We  change  the  functional  classifications  scheme  but  keep  SCOP  for  classifying 
structures.  As  was  the  case  with  the  enzyme  classification,  but  perhaps  even  more  so,  using  COGs 
to  classify  function  shows  clearly  that  certain  fold  classes  are  associated  with  certain  functions  and 
vice  versa.  Most  notably,  whereas  the  functions  associated  with  metabolism,  which  are  mostly 
enzymes,  are  preferentially  associated  with  the  alpha/beta  fold  class,  those  associated  with  cellular 
processes  (e.g.  secretion)  and  information  processing  (e.g.  transcription),  show  no  such  preference. 
They,  in  fact,  show  a  marked  preference  for  all-alpha  structure.  Small  proteins  are  absent  from  most 
of  the  COGs  classes,  except  one  part  of  information  processing  and  two  in  cellular  processes. 

The  COGs  system  classifies  functions  for  those  proteins  that  have  clear  orthologues  in  different 
species.  Thus,  conclusions  based  on  using  yeast  COGs  should  be  readily  applicable  to  other 
genomes.  This  point  is  highlighted  in  the  next  sub-figure  “3D”,  which  shows  a  COGs  versus  SCOP 
classification  for  only  the  1 10  COGs  that  are  conserved  across  all  the  analyzed  genomes  (8)  and  all 
three  kingdoms.  Thus,  this  sub-figure  would  appear  exactly  the  same  for  E.  coli ,  M.  jannaschii  or  a 
number  of  other  genomes.  It  clearly  shows  how  much  more  common  the  information  processing 
proteins  are  among  the  most  conserved  and  ancient  proteins.  Moreover,  note  how  these  most 
ancient  proteins  appear  to  have  less  of  a  preference  for  a  particular  structural  class  than  the  “more 
modem”  metabolic  ones.  This  suggests  that  large-scale  duplication  of  alpha/beta  folds  for  use  in 
metabolism  is  what  gave  rise  to  stronger  fold-function  association  in  figure  3C. 

Subfigure  E  shows  another  functional  classification  scheme,  the  MIPS  Yeast  functional  catalogue 
(Mewes  et  al.,  1997)  (hereafter  just  referred  to  as  "MIPS").  Unlike  the  COGs  scheme,  this  has  the 
advantage  of  being  applicable  to  every  yeast  ORF.  However,  it  has  many  more  categories  and  about 
a  third  of  the  yeast  ORFs  are  classified  into  multiple  categories  (sometimes  five  or  more),  making 
interpretation  of  the  results  a  bit  more  ambiguous. 

The  Most  Versatile  Folds  and  the  Most  Versatile  Functions 

Returning  to  considerations  of  all  of  Swissprot,  Figure  5  lists  the  16  most  versatile  folds.  The  top  5 
are  the  TIM-barrel,  the  alpha-beta  hydrolase  fold,  the  Rossmann  fold,  the  P-loop  containing  NTP 
hydrolase  fold,  and  the  ferredoxin  fold.  Four  of  these  are  alpha/beta  folds  and  one  is  alpha+beta.  All 
five  have  non-enzymatic  functions  as  well  as  5  to  15  enzymatic  ones.  The  most  versatile  folds 
include,  in  addition,  four  all-beta  and  two  all-alpha  folds. 
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Figure  6  lists  the  18  functions  that  have  the  most  different  folds  associated  with  them,  each  having 
at  least  3  associated  folds.  The  most  versatile  functions  are  those  of  glycosidases  and  carboxy- 
lyases  (3.2.1  and  4.2.1),  which  are  associated  with  seven  different  fold  types  each,  recruited  from  at 
least  three  different  fold  classes.  The  next  two  most  versatile  functions,  the  phosphoric  monoester 
hydrolases  and  the  linear  monoester  hydrolases  (3.1.3  and  3.5.1),  are  associated  with  six  different 
fold  types  each.  Most  of  the  versatile  functions  are  associated  with  folds  in  completely  different 
fold  classes.  This  suggests  that  these  enzymes  developed  independently,  providing  many  examples 
of  convergent  evolution.  In  contrast,  only  three  functions,  all  oxidoreductases,  are  associated  with 
folds  in  a  single  class  (last  three  rows  in  Figure  6).  These  folds  are  all  alpha/beta,  namely  the  TIM- 
barrel,  Rossmann,  and  Flavodoxin  folds. 

Specific  Functional  Convergences  involving  Different  Folds 

Even  on  the  level  of  specificity  of  4-component  EC-numbers,  several  enzymatic  functions  are 
performed  by  unrelated  structures.  Figure  1  shows  a  dramatic  example,  two  different  carbonic 
anhydrases  with  the  same  EC  number  4.2. 1.1  but  with  clearly  different  structures  (Kisker  et  al., 
1996).  Table  3  shows  further  examples  in  a  more  systematic  fashion.  Most  of  these  occur  in 
different  evolutionary  lineages.  For  instance,  the  all-alpha  Vanadium  chloroperoxidase  occurs  only 
in  fungi,  while  the  alpha/beta  non-heme  chloroperoxidase  occurs  only  in  prokaryotes.  Another 
example  is  beta-glucanase.  It  has  as  many  as  three  different  structural  representations,  from  three 
different  fold  classes.  While  it  has  an  all-beta  structure  in  B.  subtilis,  it  has  an  all-alpha  variant  in  B. 
circulans,  and  an  alpha/beta  structure  in  tobacco. 

Specific  Functional  Divergences  on  Same  Fold 

Quite  a  number  of  SCOP  domains  each  have  sequence  similarity  with  Swissprot  proteins  of 
different  function.  We  separated  these  into  cases  in  which  the  structural  domain  has  similarity  to 
proteins  with  different  enzymatic  functions  only  and  those  in  which  a  domain  shows  homology  to 
both  enzymes  and  non-enzymes  (Table  4A  and  4B,  respectively).  Table  4A  includes  the  well- 
known  lactalbumin-lysozyme  C  similarity  and  the  well  documented  case  of  homology  between  an 
eye-lens  structural  protein  and  an  enzyme  (crystallin  and  gluthathione  s-transferase)  (Cooper  et  al, 
1993;  Qasba  &  Kumar,  1997).  It  includes  several  other  notable  divergences,  such  as  the  one 
between  lysophospholipidase  and  galectin,  and  the  one  between  an  elastase  and  an  antimicrobial 
protein  (Morgan  et  al.,  1991).  Remarkably,  of  the  seven  domains  in  this  table,  three  belong  to  the 
all-beta  class. 

“Multifunctionality”  versus  e-value 

Figure  7  shows  how  the  number  of  “multifunctional'’  domains,  i.e.  domains  with  sequence 
similarity  to  proteins  with  different  functions,  varies  as  the  function  of  the  stringency  of  the  match 
score  threshold.  We  used  a  minimal  version  of  SCOP  in  which  the  structures  in  PDB  were  clustered 
into  990  representative  domains  (see  description  in  caption  to  Figure  6).  The  figure  shows  how  the 
percentage  of  domains  that  have  sequence  similarity  to  proteins  with  different  functions  (in  terms  of 
three -component  EC  numbers)  varies  with  sequence  similarity.  This  decreases  approximately 
monotonically  as  a  function  of  the  exponent  of  the  e-value  threshold.  Interestingly,  there  is  a 
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breaking  point  around  log  (e- value)  =  -5,  as  the  sharply  decreasing  number  of  functions  slows  down 
and  the  matches  reach  the  level  of  biological  significance. 

Our  graph  can  be  loosely  compared  with  the  classic  graph  of  Chothia  and  Lesk  showing  the  relation 
of  similarity  in  structure  to  that  in  sequence  (Chothia  &  Lesk,  1986).  It  roughly  shows  the  chance  of 
functional  similarity  (or  more  precisely  the  chance  of  functional  difference)  with  a  given  level  of 
sequence  similarity  between  an  enzyme  and  a  protein  of  unknown  function.  For  example,  with  an  e- 
value  of  10  10,  there  is  only  an  ~5%  chance  that  an  unknown  protein  homologous  to  a  certain 
enzyme  has  in  fact  a  different  function.  Moreover,  our  graph  is  in  excellent  agreement  with  the 
findings  of  Russell  et  al.  who  also  found  that  the  proportion  of  homologues  with  different  functions 
is  around  10%  (Russell  et  al.,  1998).  This  shows  that  there  is  a  low  chance  that  a  single-domain 
protein,  highly  homologous  to  a  known  enzyme,  has  a  different  function. 
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DISCUSSION  AND  CONCLUSIONS 


Overview 

We  have  investigated  the  relationship  between  the  structure  and  function  of  proteins  by  comparing 
functionally  characterized  enzymes  in  Swissprot  with  structurally  characterized  domains  in  SCOP. 

It  is  a  timely  subject,  as  the  number  of  three-dimensional  protein  structures  is  increasing  rapidly  and 
the  recent  completion  of  several  microbial  genomes  highlights  the  need  for  functional 
characterization  of  the  gene  products  and  identification  of  enzymes  participating  in  metabolic 
pathways  (Koonin  etal.,  1998). 

We  tried  to  be  as  objective  and  as  unbiased  as  possible,  taking  only  enzymes  with  a  single  assigned 
function  and  only  single-domain  matches.  We  ignored  Swissprot  proteins  with  dubious  or  unknown 
function,  or  with  incomplete  sequence.  Given  these  criteria,  several  tendencies  are  clear.  The 
alpha/beta  folds  tend  to  be  enzymes.  The  all-alpha  folds  tend  to  be  non-enzymes  and  the  all-beta 
and  alpha+beta  folds  tend  to  have  a  more  even  distribution  between  enzymes  and  non-enzymes. 

Our  analysis  of  proteins  from  yeast  and  E.  coli  has  shown  that  the  functional  distribution  of  folds 
does  not  differ  greatly  from  the  whole  of  Swissprot.  E.  coli,  however,  appears  to  have  somewhat 
more  alpha/beta  enzymes  and  less  non-enzymes. 

Functional  Assignment  Complexities 

We  identified  four  specific  complexities  in  our  functional  assignment  worth  mentioning: 

(1)  There  is  not  always  a  one-to-one  relationship  between  gene  protein  and  reaction  (Riley,  1998). 
An  enzyme  can  have  two  functions  or  two  polypeptides  from  two  different  genes  can  oligomerize  to 
perform  a  single  function.  It  might  be  that  some  of  the  fold-functions  combinations  in  Figure  2 
occur  together  in  multi-domain  proteins  (which  otherwise  were  not  the  subject  of  this  survey).  An 
exhaustive  screening  revealed  that  only  four  pairs  of  folds  in  Figure  2  were  present  concurrently  in 
multi-domain  proteins.  Each  of  these  reduced  by  one  the  number  of  independent  fold-function 
combinations.  (The  four  pairs  were  as  follows,  with  one  representative  Swissprot  protein  in  each 
category,  EC  numbers  in  brackets,  and  then  SCOP  fold  numbers:  PTAA_ECOLI  [2.7.1]  has  4.049 
and  2.055  folds,  TRP_COPCI  [4.2.1]  has  3.057  and  4.005  folds,  URE1_HELFE  [3.5.1]  has  4.005 
and  2.056  folds,  while  XYNA_RUMFL  [3.2.1]  has  2.018  and  3.001  folds.) 

(2)  The  functions  associated  with  similar  structures  often  turn  out  to  be  analogous,  even  if  they 
show  significant  difference  in  their  EC  numbers.  For  example,  Acetyl-CoA  carboxylase  and 
Methylmalonyl-CoA  carboxyltransferase  enzymes  are  both  actually  part  of  enzyme  complexes  in 
which  they  perform  the  same  function,  acting  as  enzyme  carriers.  This  similarity  is  not  reflected  in 
their  EC  classification  numbers  (6.4. 1.2  and  2. 1.3.1,  respectively). 

(3)  More  generally,  there  are  clearly  some  drawbacks  to  the  EC  system.  The  EC  system  is  a 
classification  of  reactions,  not  underlying  biochemical  mechanisms.  An  enzyme  classification 
system  based  explicitly  on  reaction  mechanism  (e.g.  "involves  pyridoxal  phosphate"  or  "involves 
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Ser  as  a  nucleophile")  might  also  prove  interesting  to  compare  with  protein  structure.  Alternatively, 
one  based  on  pathways  might  be  worthwhile  since,  as  pointed  out  by  Martin  et  al.  (1998),  “it  may 
be  that  more  significant  relationships  occur  within  pathways,  where  the  substrate  is  successively 
transferred  from  enzyme  to  enzyme  along  the  pathway,  requiring  similar  binding  sites  at  each 
stage”. 

(4)  In  all  of  Swissprot  the  majority  of  the  101  folds  with  only  non-enzymatic  functions  probably 
have  several  functions,  but  we  were  not  able  to  consider  them  separately  here,  lacking  a  general 
protein  function  classification  system  for  non-enzymes.  Such  a  system  is  not  easy  to  derive.  For 
instance,  if  we  took  only  the  first  three  words  of  all  the  description  lines  in  Swissprot,  we  would 
end  up  with  about  10000  different  protein  functions  (besides  enzymes).  An  approximate  solution  to 
this  problem  is  offered  by  a  recent  work  that  has  classified  81%  of  Swissprot  into  one  of  three  broad 
categories  in  an  automated  fashion  (Tamames  et  al.,  1997).  However,  one  way  we  did  tackle  this 
problem  was  by  focussing  on  the  yeast  genome  for  which  there  are  a  number  of  overall  functional 
classification  systems.  This  work  showed  that  the  preferred  association  of  folds  with  certain 
functions  occurs  for  non-enzymes  as  well  as  enzymes.  Furthermore,  the  results  for  the  highly 
conserved  COGs  would  be  expected  to  be  exactly  the  same  in  other  genomes. 

Biases 

Our  results  are  undoubtedly  affected  to  some  degree  by  the  biases  inherent  in  the  databanks,  e.g. 
towards  mammalian,  medically  relevant  proteins  and  towards  proteins  that  easily  crystallize.  Such 
biases  probably  result  in  the  higher  representation  of  enzymes  in  the  structural  databases  —  in  the 
PDB  and  therefore  in  SCOP.  This  might  be  the  cause  of  the  higher  occurrence  of  alpha/beta 
proteins  in  our  tables  and  the  higher  density  of  matches  in  this  class. 

One  interesting  question  related  to  biases  is  whether  looking  only  at  individual  genomes  instead  of 
the  whole  database  will  give  different  results.  Our  results  for  yeast  suggest  that  it  is  not  necessarily 
the  case. 

Comparison  with  Martin  et  al.  (1998) 

Martin  et  al.  (1998)  performed  a  similar  analysis  to  the  one  here.  One  of  the  conclusions  of  their 
careful  study  was  that  there  was  no  relationship  between  the  top-level  CATH  classification  and  the 
top-level  EC  class.  This  seems  to  be  at  odds  with  our  results.  However,  we  have  found  the 
conclusions  to  be  consistent.  There  are  a  number  of  reasons  for  this: 

(1)  Martin  et  al.  tabulate  statistics  on  only  the  proteins  in  the  PDB.  They  found  a  clear  alpha/beta 
preference  for  proteins  in  the  oxidoreductase,  transferase,  and  hydrolase  categories  (EC  1-3), 
but  for  the  lyase,  isomerase,  and  ligase  categories  (EC  4-6)  they  observe  different  tendencies. 
However,  they  did  not  have  sufficient  counts  to  establish  statistical  significance  for  this  latter 
finding.  (This  is  basically  what  we  observe  in  Figure  4B.)  Because  in  our  analysis  we  use  all  of 
Swissprot  and  we  tabulate  our  statistics  a  little  differently  (in  terms  of  combinations),  we  get 
more  “counts”  than  Martin  et  al.  Thus,  we  are  able  to  argue  that  the  different  distribution  of  fold 
function  combinations  observed  for  lyases,  isomerases,  and  ligases  are  significant.  This  is 
borne  out  by  the  chi-square  statistics  at  the  end  of  table  2. 
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(2)  Martin  et  al.'s  “no-relationship”  conclusion  applies  only  to  comparisons  between  the  different 
enzyme  classes.  However,  we  find  our  largest  differences  when  comparing  non-enzymes  to 
enzymes  and  also  comparing  between  the  various  types  of  non-enzymes. 

(3)  The  CATH  classification  that  Martin  et  al.  use  has  only  three  classes  in  its  topmost  level.  In 
contrast,  SCOP  has  six  top  classes  (table  1).  While  this  larger  number  of  categories  does  tend 
to  degrade  our  statistics  somewhat,  it  also  highlights  some  differences  that  cannot  be  observed 
in  terms  of  the  CATH  classes  alone  -  e.g.  we  find  clear  differences  between  alpha+beta  and 
alpha/beta  proteins  and  also  between  small  proteins  and  all  others. 

Apparently  High  Occurrence  of  Convergent  Evolution 

Note  that  the  table  in  Figure  2  is  not  square:  it  has  more  folds  than  functions.  This  shape  leads  to  a 
number  of  interesting  conclusions.  The  331  fold-function  combinations  we  observe  for  229  folds 
and  92  functions  imply  that  there  are  1.2  functions  per  fold  and  3.6  folds  per  function.  However, 
these  numbers  are  somewhat  skewed  by  the  large  number  of  folds  (101)  associated  only  with  the 
single  non-enzymatic  function.  If  we  exclude  these,  we  get  128  “enzyme-related”  folds,  which  are, 
in  turn,  associated  with  230  (=331-101)  different  fold-function  combinations.  This  implies  that  for 
the  enzyme-related  folds  there  are  on  average  1.8  functions  per  fold  and  2.5  folds  per  function 
(230/128  and  230/92).  The  larger  number  of  folds  per  function  than  functions  per  fold  seems  to 
suggest  that  nature  tends  to  reinvent  an  enzymatic  function  (i.e.  convergent  evolution)  more  often 
than  modify  an  already  existing  one  (i.e.  functional  divergence). 

How  can  we  explain  this?  First,  1.8  is  a  lower  estimation  for  the  number  of  functions  per  fold  as  the 
non-enzymatic  functions  were  bundled  into  one  group  here.  Second,  there  are  several  examples  of 
functional  divergence  for  a  fold  within  one  3-component  enzyme  category  that  are  not  reflected  in 
our  tables.  For  instance,  the  1.1.1  category  has  248  different  enzymes,  which  all  share  the  same 
fold.  Third,  the  results  in  this  paper  were  derived  from  databases  comprised  of  data  from  several 
organisms.  It  is  quite  possible  that  within  one  organism,  functional  divergence  is  more  prevalent 
than  convergent  evolution. 

Superfolds  and  Superfunctions 

Are  functions  more  diverse  for  the  more  common  folds?  To  some  degree  this  brings  up  a  "chicken- 
and-the-egg"  issue.  Do  folds  have  more  functions  because  they  occur  more  often  or  is  it  the  other 
way  around?  The  commonness  of  a  fold  is  often  quantified  by  the  number  of  non-homologous 
sequence  families  accommodated  by  the  fold,  and  folds  accommodating  many  families  of  diverse 
sequences  have  been  dubbed  “superfolds”  (Orengo  et  al.,  1993).  We  find  that  there  seems  to  be  a 
loose  connection  between  the  number  of  diverse  sequence  families  associated  with  a  particular  fold 
(in  SCOP)  and  the  functional  diversity  of  that  fold.  For  instance,  the  top  superfold  is  the  TIM- 
barrel;  it  also  has  the  most  functions  associated  with  it  (15  different  enzymatic  functions  as  shown 
in  Figure  4).  On  the  other  hand,  there  are  exceptions:  the  alpha/beta  hydrolases  and  the  Rossmann 
fold  are  both  associated  with  22  sequence  families  in  SCOP,  but  while  the  former  has  eight 
different  enzymatic  functions,  the  latter  has  only  three. 

Finally,  while  there  is  a  high  incidence  of  particular  functions  with  many  folds  (“superfunctions”), 
as  well  as  folds  with  many  functions,  the  distribution  of  superfunctions  appears  to  be  more  uniform 
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and  less  concentrated  on  a  few  exceptionally  versatile  individuals  than  is  the  case  for  folds.  That  is, 
comparing  Figures  3  and  4  one  can  see  that  the  top  9  most  versatile  functions  are  associated  with  5 
to  7  folds  while  the  top  9  most  versatile  folds  carry  out  from  6  to  as  many  as  16  functions.  This  last 
value  is  for  the  TIM-barrel  and  underscores  the  uniqueness  of  this  fold  as  a  generic  scaffold  (see 
Figure  1  for  an  illustration  of  this  fold). 

Why  Folds  are  associated  with  Functions:  Chemistry  vs.  Flistory 

Why  is  a  certain  fold  chosen  to  carry  out  a  particular  function?  It  is,  of  course  not  possible  to 
answer  this  question  definitively  at  present.  However,  there  are  two  broad  themes  that  emerge  from 
our  analysis.  The  first  is  favorable  chemistry.  Perhaps  the  TIM-barrel  design  simply  provides  a 
"more  efficient"  scaffold  for  enzyme  reactions  so  that  is  why  it  is  so  prevalent.  Another  factor  is 
history.  Perhaps  the  association  between  a  particular  fold  and  its  function  reflects  a  particular 
"accident"  that  took  place  at  the  beginning  of  cellular  evolution.  However,  once  this  choice  was 
made  it  was  impossible  to  undo  even  if  other  folds  would  be  more  chemically  suitable.  This  could 
be  the  situation  for  the  ribosomal  proteins  (and  is  borne  out  by  the  results  of  figure  4D). 

MATERIALS  AND  METHODS 

Sequence  Matching  to  Swissprot 

All  the  protein  sequences  in  Swissprot  35  were  compared  with  all  the  protein  domain  sequences  in 
SCOP  1.35  by  standard  database  search  programs  (WU-BLAST)  (Altschul  et  cil.,  1990).  The 
following  five  criteria  were  used  in  the  searches: 

(1)  At  least  three  of  the  four  components  of  the  EC  number  are  assigned  in  the  DE  line  of  the 
Swissprot  entries. 

(2)  Fragments  in  Swissprot  were  excluded  (this  affected  about  10%  of  the  entries). 

(3)  For  WU-BLAST  searches  an  e-value  threshold  of  .0001  was  used,  unless  stated  otherwise. 

(4)  Only  ‘monoenzymes,’  i.e.  proteins  with  only  one  enzymatic  function,  were  considered.  This 
excluded  less  than  0.5%  of  the  Swissprot  enzymes. 

(5)  Only  ‘single-domain’  matches  with  Swissprot  proteins  were  taken  into  consideration.  This 
means  those  proteins  that  had  a  match  with  a  SCOP  domain  covering  most  of  the  Swissprot 
protein.  Specifically,  we  required  that  less  than  100  amino  acids  be  left  uncovered  in  the 
Swissprot  entry  by  a  match.  We  are  aware  that  this  is  only  an  approximation,  as  there  are 
domains  with  less  than  100  amino  acids;  however  it  is  considerably  less  than  the  average  length 
of  a  SCOP  domain  (163  residues)  and  seems  to  be  a  reasonable  threshold  in  an  automated 
approach. 

All  the  searches  were  repeated  using  FASTA  with  an  e-value  threshold  of  .01  (Pearson,  1998; 
Pearson  &  Lipman,  1988).  The  results  obtained  by  the  two  different  comparison  programs  were  in 
agreement  with  each  other.  That  is,  the  FASTA  searches  did  not  result  in  any  new  combinations  of 
folds  and  enzymatic  functions  (a  new  dot  in  Figure  1),  and  therefore  are  not  shown. 
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Sequence  matching  to  the  Yeast  genome 

To  get  as  great  a  coverage  of  the  yeast  genome  as  possible,  we  did  a  sequence  comparison  for  just 
figure  4  using  an  altered  protocol.  We  first  ran  the  PDB  against  the  yeast  genome  using  FASTA 
and  kept  all  matches  with  a  better  than  0.01  E-value  (Pearson,  1998;  Pearson  &  Lipman,  1988). 
Then,  to  increase  our  number  of  matches  further  we  used  the  PSI-blast  program  (Altschul  et  al., 
1997).  This  program  is  somewhat  more  complex  to  run  than  FASTA,  involving  embedding  the 
yeast  genome  in  NRDB  and  running  PDB  query  sequences  against  it  in  an  iterative  fashion,  adding 
the  matches  found  at  each  round  to  a  growing  profile.  We  used  the  PSI-blast  parameters  adapted 
from  Teichmann  et  al.  (1998):  an  e- value  threshold  of  .0005  to  include  matches  in  the  profile  and 
iteration  of  up  to  30  times  or  to  convergence.  We  did  not  continuously  parse  the  output  and 
accepted  matches  at  the  final  iteration  that  had  E-value  scores  better  than  .0001.  The  number  of 
iteration  to  convergence  varies  depending  on  the  PDB  domains  being  run.  Runs  that  take  many 
iterations  such  as  those  for  the  immunoglobulin  superfamily  take  quite  a  long  time  (up  to  Vi  hour  on 
DEC  500  MHz  workstation)  and  create  large  output  files.  In  total,  PSI-blast  finds  many  more 
matches  than  either  FASTA  or  WU-BLAST.  However,  it  has  problems  with  certain  small  and 
compositionally  biased  proteins.  We  used  FASTA  for  these  and  also  tried  to  remove  compositional 
bias  through  running  the  SEG  program  with  standard  parameters  (Wootton  &  Federhen,  1996). 

How  the  Structural  Classifications  were  Used:  SCOP  and  CATH 

SCOP  hierarchically  clusters  all  the  domains  in  the  PDB  database,  assigning  a  5-component  number 
to  each  domain  (Murzin  et  al.,  1995).  The  first  component  in  the  SCOP  numbers  denotes  the 
structural  class  to  which  the  domain  in  question  belongs.  The  second  component  of  the  SCOP 
numbers  designates  the  'fold'  type  of  the  domain.  There  are  altogether  361  different  fold  types  in 
SCOP  1.35.  The  6  SCOP  classes  used  in  this  survey  are  listed  in  Table  IB. 

In  this  study  a  95%  non-redundant  subset  of  SCOP,  was  used,  i.e.  all  pairs  of  domains  had  less  than 
95%  sequence  homology.  This  set  is  denoted  pdb95d  and  is  available  from  the  SCOP  website 
(scop.mrc-lmb.cam.ac.uk).  We  used  version  1.35,  which  had  2314  protein  domains.  (The  yeast 
analysis  used  a  more  recent  version  of  SCOP,  1.38,  which  had  3206  domains.) 

The  CATH  classification  classifies  structures  in  analogous  fashion  to  SCOP  (Orengo  et  al.,  1997). 
However,  the  exact  structure  of  the  classification  is  not  the  same,  with  an  additional  architecture 
level  inserted  between  the  top-level  class  and  the  fold-level.  In  our  use  of  the  classification,  we 
created  a  limited  mapping  table  that  associated  each  SCOP  domain  in  pdb95d  with  its 
corresponding  classification  in  CATH  1.4.  This  was  not  always  possible  to  do  unambiguously.  As  a 
result,  we  left  out  the  ambiguous  matches  from  the  statistics. 

How  the  Functional  Classifications  were  Used:  ENZYME,  COGS,  and  MIPS 

The  EC  numbers  of  enzymes  are  composed  of  four  components  (Barrett,  1997):  (i)  The  first 
component  shows  to  which  of  the  six  main  divisions  the  enzyme  belongs;  (ii)  the  second  figure 
indicates  the  subclass  (referring  to  the  donor  in  oxidoreductases  or  the  group  transferred  in 
transferases,  or  the  affected  bond  in  hydrolases,  lyases  or  ligases);  (iii)  the  third  figure  indicates  the 
sub-subclass  (e.g.  indicating  the  type  of  acceptor  in  oxidoreductases)  and  (iv)  the  fourth  figure  gives 
the  serial  number  of  the  enzyme  in  its  sub-subclass.  The  six  main  divisions  are  listed  in  Table  1A. 


15 


In  the  analysis  of  all  of  Swissprot,  when  we  counted  the  number  of  non-enzymatic  matches,  all  the 
proteins  called  ‘HYPOTHETICAL’  and  all  the  proteins  having  an  ‘-ase’  word  ending  but  lacking 
an  EC  number  in  their  description  were  excluded,  because  of  their  functional  ambiguity.  For 
relating  the  sequence  matches  of  the  yeast  genome  to  the  EC  system,  we  used  essentially  the  same 
criteria  as  we  did  for  all  of  Swissprot  (see  above):  single-domain,  mono-enzyme  matches  with  at 
least  a  3-component  EC  number. 

The  COGs  and  especially  the  MIPS  classifications  are  a  bit  more  complex  than  the  EC  system  in 
that  they  include  non-enzymes  as  well  as  enzymes  (Tatusov  et  al.,  1997;  Koonin  et  al.,  1998; 

Mewes  et  al.,  1997).  They  often  associate  multiple  functions  or  roles  to  a  given  yeast  ORF.  This 
happens  for  more  than  a  third  of  the  yeast  ORFs  with  MIPS.  In  this  case,  if  we  could  clearly  show  a 
PDB  match  was  associated  with  a  single  functional  domain  we  made  only  that  pairing.  Otherwise 
we  associated  all  the  functions  assigned  to  a  given  PDB  match  to  its  respective  fold. 

Availability  of  Results  over  the  Internet 

A  number  of  detailed  tables  relevant  to  this  paper  will  be  made  available  over  the  Internet  at 
http://bioinfo.mbb.yale.edu/genome/foldfunc  —  in  particular,  a  “clickable”  version  of  Figure  1  and 
large  data  files  giving  all  the  fold  assignment  and  fold-function  combinations  for  Swissprot  and 
yeast. 
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Tables 


Table  1,  Broad  Structural  and  Functional  Categories 

A.  Functional  categories  in  Swissprot  35 


EC 

Category 

Category  Name 

Abbrev¬ 

iation 

Num.  of 
Functions  in 
Category 

0.0.0 

Non-enzymes 

NONENZ 

1 

-j  *  * 

Oxidoreductases 

OX 

86 

2  *  * 

Transferases 

TRAN 

28 

3.*.* 

Hydrolases 

HYD 

53 

4  *  * 

Lyases 

LY 

15 

5.*.* 

Isomerases 

ISO 

16 

6.*.* 

Ligases 

LIG 

9 

Total:  208 


List  of  the  functional  (enzymatic)  categories  in  Swissprot  and  the  abbreviations  used  throughout  the 
paper.  The  values  denote  the  number  of  3-component  EC-numbers  in  each  category. 

B.  Structural  classes  in  SCOP  1.35 


Fold 

Class 

Class  Name 

Abbrev¬ 

iation 

Num.  of 
Folds  in 
Class 

1 

All-alpha 

A 

81 

2 

All-beta 

B 

57 

3 

Alpha  and  beta 

A/B 

70 

4 

Alpha  plus  beta 

A+B 

91 

5 

Multidomain 

MULTI 

19 

6 

Transmembrane 

TM 

9 

7 

Small  proteins 

SML 

43 

Total:  361 


List  of  the  structural  classes  in  SCOP  studied  in  this  paper  and  the  abbreviations  used  for  the 
classes.  Values  denote  the  number  of  folds  in  each  class  in  SCOP  1.35.  Class  6  is  not  used  in  the 
analysis  here. 
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Table  2,  Statistics  over  42  structure-function  classes 


This  table  shows  various  totals  from  Figure  2  distributed  among  the  42  structure-function  classes  — 
i.e.  the  seven  functional  categories  in  Table  1A  multiplied  by  the  six  structural  categories  in  Table 
IB.  Part  A  shows  how  many  potential  fold-function  combinations  there  are  in  Figure  2  amongst 
each  of  the  42  classes.  Part  B  shows  how  many  of  these  21068  possible  combinations  are  actually 
observed.  Part  C  shows  the  total  number  of  different  folds  (i.e.  selected  columns  in  figure  1)  in  each 
class.  Part  D  shows  the  total  number  of  different  functions  (i.e.  selected  rows  in  Figure  2)  in  each 
class.  Part  E  shows  the  total  number  of  matching  Swissprot  proteins  in  the  42  classes.  Note  that  to 
observe  a  fold-function  combination  one  only  needs  the  existence  of  a  single  match  between  a 
Swissprot  protein  and  a  SCOP  domain.  However,  there  can  be  many  more.  That  is  why  the  totals  in 
this  table  sum  up  to  so  much  larger  an  amount  than  331. 

Here  is  an  example  of  how  to  read  parts  A  to  E  of  the  table,  focussing  on  the  all-alpha, 
oxidoreductase  region.  Part  A  shows  that  there  are  1 104  cells,  filled  or  unfilled,  in  this  region, 
corresponding  to  possible  combinations.  Part  B  shows  that  13  of  these  1104  cells  are  filled, 
corresponding  to  observed  all-alpha,  oxidoreductase  combinations.  Part  C  shows  that  there  are  7 
folds,  corresponding  to  columns  with  filled  cells  in  this  region.  Part  D  shows  that  there  are  8 
functions,  corresponding  to  rows  with  filled  cells  in  this  region.  Finally,  in  Part  E  we  find  that  there 
are  150  Swissprot  entries  that  have  matches  with  a  SCOP  domain.  They  correspond  to  the  13 
observed  combinations  in  Part  B. 

Parts  F  and  G  give  information  on  the  statistical  significance  of  the  differences  observed  between 
the  42  structure-function  classes.  Part  F  gives  the  significance  that  the  observed  distribution  of  fold- 
function  combinations  in  a  given  functional  class  is  different  than  average  (i.e.  the  null  hypothesis 
that  distribution  of  fold-function  combinations  is  the  same  in  each  functional  class).  This  is  very 
similar  to  the  derivation  in  Martin  et  al.  (1998).  A  chi-squared  statistic  is  computed  for  each  of  the  7 
functional  classes  in  the  conventional  way:  %  (f)  =  52s  (Osf-  ESf)  /  ESf ,  where  for  a  given  functional 
class  f  and  structure  class  s,  0Sf  is  the  observed  number  of  fold-function  combinations  and  Esf  is  the 
expected  number.  Esf  is  simply  computed  from  scaling  the  "sum"  column  and  row  in  Part  B  of  the 
table:  Esf  =  TsTf/T,  where  Ts  is  the  total  number  of  combinations  in  a  given  structural  class  s  (sum 
row),  Tf  is  the  total  number  of  combinations  in  a  given  functional  class  f  (sum  column),  and  T  is  the 
total  observed  number  of  combinations,  331.  Part  G  gives  the  statistical  significance  that  the 
observed  distribution  of  fold-function  combinations  in  a  given  structural  class  is  different  than 
average.  To  compute  this  one  simply  sums  over  functions  instead  of  structures:  %  (s)  =  £f  (0Sf-Esf)  / 
Esf.  After  each  chi-squared  statistic  is  reported,  a  rough  probability  or  P-value  is  given.  This  gives 
the  chance  the  observed  distribution  could  be  obtained  randomly. 
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Table  2  (continued) 


A.  Number  of  possible  combinations  between  folds  and  functions  in  each  of  42  classes 
(number  of  cells  in  Figure  2) _ 


A 

B 

A/B 

A+B 

MULTI 

SML 

sum 

NONENZ 

46 

36 

48 

56 

15 

28 

OX 

864 

1152 

1344 

672 

5496 

TRAN 

468 

624 

728 

195 

2977 

HYD 

1334 

1392 

1624 

435 

812 

6641 

LY 

414 

324 

432 

2061 

ISO 

2290 

LIG 

276 

216 

288 

336 

168 

1374 

sum 

4232 

3312 

4416 

5152 

1380 

2576 

21068 

B.  Number  of  observed  combinations  between  folds  and  functions  in  each  of  42  classes 
(number  of  filled  cells  in  Figure  2) _ 


A 

B 

A/B 

A+B 

MULTI 

sum 

NONENZ 

34 

14 

28 

4 

26 

136 

OX 

13 

5 

17 

3 

4 

5 

47 

TRAN 

3 

3 

16 

8 

5 

35 

HYD 

4 

11 

30 

18 

4 

67 

LY 

2 

3 

13 

r  5 

23 

ISO 

1 

2 

7 

4 

2 

16 

LIG 

1 

2 

3 

1 

7 

sum 

57 

55 

99 

69 

20 

31 

331 
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Table  2  (continued) 


C.  Number  of  folds  in  each  of  the  42  classes  (columns  with  a  filled  cell  in  Figure  2) 


A 

B 

A/B 

A+B 

MULTI 

SML 

sum 

NONENZ 

34 

14 

28 

4 

26 

136 

OX 

7 

5 

9 

3 

3 

3 

30 

TRAN 

3 

2 

15 

6 

5 

31 

HYD 

4 

8 

19 

18 

3 

52 

LY 

2 

3 

8 

5 

18 

ISO 

1 

2 

7 

4 

2 

16 

LIG 

1 

1 

3 

1 

6 

sum 

51 

51 

73 

67 

18 

29 

289 

D.  Number  of  functions  in  each  of  the  42  classes  (rows  with  a  filled  cell  in  Figure  2) 


A 

B 

A/B 

A+B 

MULTI 

SML 

sum 

NONENZ 

1 

1 

1 

1 

1 

1 

6 

OX 

8 

5 

9 

3 

3 

5 

33 

TRAN 

2 

3 

13 

8 

4 

30 

HYD 

4 

7 

19 

14 

4 

48 

LY 

2 

2 

7 

3 

14 

ISO 

1 

2 

5 

4 

1 

13 

LIG 

1 

2 

2 

1 

6 

sum 

18 

21 

56 

35 

14 

6 

E.  Total  number  of  matching  Swissprot  sequences  in  each  of  the  42  fold-function  classes 


A 

B 

A/B 

A+B 

MULTI 

SML 

sum 

NONENZ 

1940 

1159 

638 

892 

5295 

OX 

150 

388 

68 

18 

876 

TRAN 

65 

14 

363 

116 

174 

732 

HYD 

116 

394 

295 

452 

92 

1349 

LY 

47 

168 

359 

ISO 

2 

54 

122 

22 

2 

LIG 

5 

26 

69 

24 

124| 

sum 

2313 

1875 

1922 

1451 

466 

910 
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Table  2  (continued) 


F.  How  much  does  each  of  the  fold  classes  deviate  from  the  average  distribution  of 
functions? 


X 

P 

A 

17.5 

<0.01 

B 

5.2 

<0.6 

A/B 

32.5 

<0.00002 

A+B 

7.7 

<0.3 

MULTI 

9.9 

<0.2 

SML 

27.8 

<0.0002 

G.  How  much  do  each  of  the  function  classes  deviate  from  the  average  distribution  of 
folds? 


X 

P 

NONENZ 

40.7 

<0.0000002 

OX 

9.9 

<0.08 

TRAN 

13.1 

<0.03 

HYD 

17.3 

<0.005 

LY 

10.2 

<0.08 

ISO 

5.0 

<0.5 

LIG 

4.3 

<0.6 
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Table  3,  Specific  Convergences 


Explicit  enzymatic  functions  associated  with  different  folds.  Of  the  13  different  enzyme  functions 
listed,  eight  are  hydrolases,  five  of  which  belong  to  the  3.2.1  EC  category.  One  of  them,  beta- 
glucanase,  is  associated  with  three  different  folds.  Noth  that  most  of  the  enzymes  in  the  table  are 
associated  with  folds  from  different  classes.  Even  when  the  folds  are  from  the  same  class,  as  in  the 
case  of  protein-tyrosine  phosphatases,  they  are  clearly  different.  Fold  numbers  are  from  SCOP  1.35. 
Domain  identifiers  are  according  to  the  scop  syntax:  dlpdbcN,  where  “lpdb”  is  a  PDB  id,  “c”  is  a 
chain  identifier,  and  “N”  describes  if  this  is  the  first,  second,  or  only  domain  in  the  chain.  Thus, 
dlggtal  is  the  first  domain  in  the  A  chain  of  1GGT. 


EC# 

Enzymatic  function 

Fold  #1 

Dom  #1 

Swissprot  1 

Fold  #2 

Dom  #2 

Swissprot  2 

1.11.1.10 

CHLOROPEROXIDASE 

3.048.001 

dlbroa_ 

PRXC_PSEPY 

1 .068.001 

dlvnc _ 

PRXC_CURIN 

1.15.1.1 

SUPEROXIDE 

DISMUTASE 

2.001 .007 

dlsrda_ 

S0D1_0RYSA 

4.023.001 

dlmnga2 

SODM_BACCA 

3.1.3.48 

PROTEIN-TYROSINE 

PHOSPHATASE 

3.028.001 

dlphr _ 

PTPA_STRCO 

3.029.001 

d2hnp _ 

PYP3_SCHPO 

3.1.26.4 

RIBONUCLEASE  H 

3.038.003 

d2  m2 _ 

RNH_ECOLI 

3.039.001 

dltfr _ 

RNH_BPT4 

3.2.1 .4 

ENDOGLUCANASE 

1.061.001 

dlcem _ 

GUN_BACSP 

3.001.001 

dlecea_ 

GUN_BACPO 

3.2.1 .8 

XYLANASE 

2.018.001 

dlyna _ 

XYN_TRIHA 

3.001.001 

d2exo _ 

XYNB_THENE 

3.2.1.14 

ENDOCHITINASE 

3.001.001 

dlhvq _ 

CHIA_TOBAC 

4.002.001 

d2baa _ 

CHIX_PEA 

3.2.1.73 

BETA-GLUCANASE* 

3.001.001 

dlghr _ 

GUB_NICPL 

2.018.001 

dlgbg _ 

GUB_BACSU 

3.2.1.91 

EXOGLUCANASE 

2.018.001 

dlcela_ 

GUX1_TRIVI 

3.002.001 

dlcb2a_ 

GUX3_AGABI 

3. 5.2. 6 

BETA-LACTAMASE 

5.003.001 

dlbtl _ 

BLP4_PSEAE 

4.083.001 

dlbmc _ 

BLAB_BACCE 

4.2.1 .1 

CARBONIC 

ANHYDRASE 

2.053.001 

dlth ja_ 

CAH_METTE 

2.047.001 

d2cba _ 

CAHZ_BRARE 

5.2.1 .8 

CIS-TRANS 

ISOMERASE 

4.018.001 

dlfkd _ 

MIP_TRYCR 

2.041.001 

d2cpl _ 

CYPR_DROME 

5.4.99.5 

CHORISMATE  MUTASE 

1.079.001 

dlcsma_ 

CHMU_YEAST 

4.037.001 

d2chsa_ 

CHMU__BACSU 
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Table  4,  Specific  Divergences 

List  of  SCOP  domains  that  are  each  homologous  to  several  Swissprot  proteins  with  significantly 
different  function.  Part  A.  Domains  homologous  to  proteins  with  different  (in  the  last  three 
component  of  EC  numbers)  enzymatic  functions.  In  most  cases,  the  enzymatic  functions  remain 
analogous,  as  reflected  in  the  names  of  the  enzymes.  B.  Domains  homologous  to  proteins  with  both 
enzymatic  and  non-enzymatic  functions.  (See  Table  3  for  the  SCOP  domain  syntax.) 

A.  Two  different  enzymatic  functions 


SCOP 

domain 

fold  number 

Swissprot  1 

EC  num  1 

Function  1 

Swissprot  2 

EC  num  2 

Function  2 

d2abk _ 

1.001.054.001.001.001 

END3ECOLI 

4.2.99.18 

ENDONUCLEASE  III 

GTMR_METTF 

3.2.2.- 

POSSIBLE  G-T  MISMATCHES 
REPAIR  ENZYME 

dlbdo _ 

1.002.055.001.001.001 

BCCP_ECOLI 

6.4.1. 2 

BIOTIN  CARBOXYL  CARRIER 
PROTEIN  OF  ACETYL-COA 
CARBOXYLASE 

BCCP_PROFR 

2. 1.3.1 

BIOTIN  CARBOXYL  CARRIER 
PROTEIN  OF  METHYLMALONYL- 
COA  CARBOXYL- 
TRANSFERASE 

dldhpa_ 

1.003.001.003.001.004 

NPLECOLI 

4.1. 3.3 

N-ACETYLNEURAMINATE  LYASE 
SUBUNIT 

D AP ABAC  S  U 

4.2.1.52 

DIHYDRODIPICOLINATE 

SYNTHASE 

dlhdca_ 

1.003.018.001.002.005 

ENTA_ECOLI 

1.3.1.28 

2,3  DIHYDRO-2,3  DIHYDROXY- 
BENZOATE  DEHYDROGENASE 

AD  H I _DROMO 

1.1. 1.1 

ALCOHOL  DEHYDROGENASE  1 

dlnipa_ 

1.003.024.001.005.003 

BCHL_RHOCA 

1.3.1.33 

PROTOCHLOROPHILLIDE 
REDUCTASE  33  KD  SUBUNIT 

NIFH_THIFE 

1.18.6.1 

NITROGENASE  IRON  PROTEIN 

dlgara_ 

1.003.043.001.001.001 

PUR3YEAST 

2.1. 2.2 

PHOSPHORIBOSYLGLYCINAMIDE 

FORMYLTRANSFERASE 

PURU_CORSP 

3.5.1.10 

FORMYLTETRAHYDROFOLATE 

DEFORMYLASE 

d2dkb _ 

1.003.045.001.003.001 

OAT_RAT 

2.6.1.13 

ORNITHINE  AMINOTRANSFERASE 
PRECURSOR 

GSABBACSU 

5.4.3.8 

GLUTAMATE-1-SEMIALDEHYDE 
2,1-AMINOMUTASE  2 

dlede _ 

1.003.048.001.003.001 

DMPDP  SEPU 

3.1.1.- 

2-HYDROXYMUCONIC 
SEMIALDEHYDE  HYDROLASE 

HALO_XANAU 

3.8.1. 5 

HALOALKANE  DEHALOGENASE 

dlfua _ 

1.003.053.001.001.001 

ARAD_ECOLI 

5.1. 3.4 

L-RIBULOSE-5-PHOSPHATE  4- 
EPIMERASE 

FUCAECOLI 

4.1.2.17 

L-FUCULOSE  PHOSPHATE 
ALDOLASE 

dllmn _ 

1.004.002.001.002.010 

LCA_RAT 

2.4.1.22 

ALPHA-LACTALBUMIN 

PRECURSOR 

LYC1PIG 

3.2.1.17 

LYSOZYME  C-1 

dlfrva_ 

1.005.015.001.001.001 

FRHG_METVO 

1.12.99.1 

COENZYME  F420  HYDROGENASE 
GAMMA  SUBUNIT 

MBHS_AZOCH 

1.18.99.1 

UPTAKE  HYDROGENASE 

SMALL  SUBUNIT  PRECURSOR 
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Table  4  (continued) 

B.  Enzyme  and  Non-Enzyme 


SCOP  domain 

Fold  number 

Swissprot  1 

Enzymatic  function 

EC  number 

Swissprot  2 

Nonezymatic  function 

dlgsql 

1.001.034.001.001.007 

GTS2_MANSE 

GLUTATHIONE  S- 
TRANSFERASE  2 

2.5.1.18 

SC11_OMMSL 

S-CRYSTALLIN  SL11  (MAJOR 
LENS  POLYPEPTIDE) 

dllcl _ 

1.002.018.001.003.003 

LPPL_HUMAN 

EOSINOPHIL 

LYSOPHOSPHOLIPASE 

3.1 .1.5 

LEG7_RAT 

GALECTIN-7 

dl brbe_ 

1.002.029.001.002.003 

CFAD_RAT 

ENDOGENOUS  VASCULAR 
ELASTASE 

3.4.21.46 

CAP7_HUMAN 

AZUROCIDIN  (ANTIMICROBIAL, 
HEPARIN-BINDING  PROTEIN) 

dlmup _ .. 

1.002.039.001.001.007 

PGHD_HUMAN 

PROSTAGLANDIN-D 

SYNTHASE 

5.3.99.2 

LACC_CANFA 

BETA-LACTOGLOBULIN  III 

..dlmup _ 

1.002.039.001.001.007 

QSP_CHICK 

QUIESCENCE-SPECIFIC 

PROTEIN 

d2hhma_ .. 

1.005.007.001.002.001 

MYOP_XENLA 

INOSITOL  MONO¬ 
PHOSPHATASE 

3.1.3.25 

SUHB_ECOLI 

EXTRAGENIC  SUPPRESSOR 
PROTEIN  SUHB 

..d2hhma_ 

1.005.007.001.002.001 

STRO_STRGR 

DTDP-GLUCOSE  SYNTHASE 

2.7.7.24 

d1isua_ 

1.007.029.001.001.001 

IR0_THIFE 

IRON  OXIDASE  PRECURSOR 
(FE(II)  OXIDASE) 

1.16.3.- 

HPIT_RHOTE 

HIGH  POTENTIAL  IRON-SULFUR 
PROTEIN  (HIPIP) 
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Figures 


Figure  1,  Specific  Example  of  Convergent  and  Divergent  Evolution 

TOP  shows  an  example  of  convergent  evolution,  structures  of  two  carbonic  anhydrases  with  the 
same  enzymatic  function  (EC  number  4.2. 1.1)  but  with  different  folds.  Drawn  with  Molscript 
(Kraulis,  1991)  from  1THJ  (left  handed  beta  helix)  and  1DMX  (flat  beta  sheet).  BOTTOM  shows 
an  example  of  possible  divergent  evolution,  the  TIM  barrel.  This  fold  functions  as  a  generic 
scaffold  catalyzing  15  different  enzymatic  functions.  A  schematic  figure  of  the  TIM  barrel  fold  is 
shown  with  numbers  in  boxes  indicating  the  different  location  of  the  active  site  in  four  proteins  that 
have  this  fold.  These  four  proteins  —  xylose  isomerase,  aldose  reductase,  enolase,  and  adenosine 
deaminase  —  carry  out  very  different  enzymatic  functions,  in  four  of  the  main  EC  classes  (1.*.*, 
3.*.*,  4.*.*,  and  5.*.*).  They  have  active  sites  at  very  different  locations  in  the  barrel,  yet  they  all 
share  the  same  fold. 

See  figure  over... 
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Figure  1  (continued) 


Figure  2,  Overview 


Overview  of  all  the  single-domain  matches  between  proteins  in  Swissprot  35  and  domains  in  SCOP 
1.35.  Sequences  were  compared  with  BLAST  using  the  match  criteria  described  in  the  methods. 

The  matches  are  clustered  into  92  functions  (based  on  3-component  EC  numbers),  which  are 
arranged  on  each  row,  and  229  folds  (based  on  SCOP  fold  numbers),  which  are  arranged  on  each 
column.  The  first  row  indicates  the  matches  with  non-enzymes.  There  are,  thus,  21068  (=92  x  229) 
possible  combinations  shown  in  the  figure.  Only  the  331  are  actually  observed.  These  are  indicated 
by  filled- in  black  squares. 

See  figure  over... 
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Figure  3,  Chart  with  Breakdown  among  Structure-Function  Classes  in  2 
Genomes 

Charts  and  tables  showing  the  number  of  folds  in  each  fold  class  associated  with  only  enzymatic 
(ENZ),  only  non-enzymatic  (nonENZ),  and  both  enzymatic  and  non-enzymatic  functions  (Both). 
The  results  are  shown  for  all  of  Swissprot  (part  A),  for  just  the  yeast  genome  (part  B),  and  for  just 
the  E.  coli  genome  (part  C).  The  results  for  individual  domains  in  a  minimum  set  of  SCOP  domains 
also  support  these  tendencies  (data  not  shown).  The  numbers  in  part  B  are  not  based  on  the  PSI- 
blast  protocol  used  for  Figure  4.  Rather  they  are  found  just  as  “subsets”  of  the  overall  Swissprot 
results  to  make  them  readily  comparable  with  the  rest  of  the  paper.  Because  of  this  the  numbers  in 
this  figure  will  not  match  exactly  those  in  Figure  4  —  the  difference  having  to  do  with  the  greater 
number  of  fold-function  combinations  found  by  PSI-Blast  as  compared  to  WU-blast. 

A.  All  of  Swissprot 
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Figure  3  (continued) 


B.  Yeast 

Number  of  folds  in  the  different  functional 


categories 


C.  E.  coli 

Number  of  folds  in  the  different  functional 


categories 


Caption  to  the  figure. 
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Figure  4,  Structure-function  Classes  in  the  Yeast  Genome  Analyzed 
Through  a  Variety  of  Classification  Schemes 

This  figure  shows  the  distribution  of  fold  function  combinations  in  the  yeast  genome  as  analyzed  by 
a  variety  of  different  structure  and  functional  classifications.  Each  of  the  figures  is  a  cross 
tabulation  of  one  structural  classification  scheme  (on  the  column  heads)  versus  a  functional 
classification  (row  heads).  Part  A  shows  SCOP  versus  ENZYME;  Part  B.  CATH  vs.  ENZYME; 

Part  C,  SCOP  vs.  COGs;  Part  D,  SCOP  vs.  Most  Conversed  COGs;  Part  E,  SCOP  vs.  MIPS 
Functional  Catalogue.  Each  of  the  grid  boxes  gives  the  number  of  fold-function  combinations 
within  a  structure-function  class.  This  number  is  expressed  as  a  percentage  of  the  total  number  of 
combinations  in  the  diagram  to  make  the  graphs  readily  comparable.  The  total  number  of 
combinations  in  each  of  the  sub  figures  is  141  (A),  77  (B),  1207  (C),  120  (D),  and  66  (E).  Some 
notes  on  the  subfigures:  Part  A  is  directly  comparable  with  the  cross  tabulation  in  table  2B  for  all  of 
Swissprot.  In  Parts  D  and  E,  we  employ  the  COGs  scheme  in  exactly  the  same  fashion  as  we  did  the 
ENZYME  classification.  We  form  combinations  between  individual  yeast  COGs  and  SCOP  folds 
(e.g.  COG  0186  with  fold  2.26)  and  then  we  place  these  combinations  into  larger  structure-function 
classes.  The  COGs  overall  functional  classes  are  denoted  by  a  single  letter  and  then  are  in  turn 
grouped  into  three  broader  areas  (so,  for  instance,  the  0186-2.26  pair  would  go  into  the  structure- 
function  class  all-beta,  J).  We,  likewise,  proceed  similarly  for  the  MIPS  yeast  functional  catalogue. 
This  gives  each  function  a  2  or  3  component  number  similar  to  an  EC  number  (e.g.  07.20.3  or 
06.2).  We  use  the  first  two  numbers  to  create  combinations  with  SCOP  folds  and  then  use  the  top 
number  to  create  the  functional  classes  shown  in  the  diagram.  For  Part  E  we  just  use  the  1 10  COGs 
that  are  present  in  all  8  genomes  in  the  current  COGs  analysis  (E.  coli,  H.  influenzae,  H.  pylori,  M. 
genitalium,  M.  pneumoniae,  Synechocystis,  M.  jannaschii,  yeast). 
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Rough  Layout  of  Subfigures  to  Figure  4 
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ENZYME  ENZYME 


Figure  4  (continued),  ENLARGEMENT  of  Parts  A  and  B 


CATH 
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All  Yeast  COGs 


Figure  4  (continued),  ENLARGEMENT  of  Part  C 
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Most  Conserved  COGs 


Figure  4  (continued),  ENLARGEMENT  of  Part  D 
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SML 


MIPS  Functional  Cat. 


Figure  4  (continued),  ENLARGEMENT  of  Part  E 
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Figure  5,  The  Most  Versatile  Folds 

The  functions  associated  with  the  16  most  versatile  folds  are  shown.  Values  in  the  table  denote  the 
number  of  matches  between  a  particular  fold  type  in  pdb95d  (designated  by  its  fold  number  in 
SCOP  1.35)  and  an  enzyme  category  (represented  by  the  first  three  components  of  the  respective 
EC  numbers).  Here  and  in  the  following  tables  the  same  parameters  were  used  for  matching  as  in 
Figure  2.  The  numbers  in  the  top  row  indicate  the  number  of  functions  a  particular  fold  is 
associated  with.  The  identifiers  above  the  fold  numbers  are  either  PDB  or  SCOP  identifiers  of 
representative  structures  (the  latter  only  if  the  PDB  entry  contains  more  than  one  domain  or  chain). 
(See  the  caption  to  Table  3  for  the  syntax  of  SCOP  identifiers.)  The  first  row  in  the  table  with  the 
artificial  0.0.0  EC  number  shows  the  number  of  matches  with  non-enzymatic  functions.  Among  the 
two  all-alpha  folds  in  the  table,  Cytochrome  P450  (1.063)  is  exclusively  enzymatic,  associated  with 
five  different  enzyme  functions,  all  related  to  Cytochrome  P450.  Only  one  alpha+beta  fold, 
Ferredoxin  (4.031),  is  present  in  the  table,  predominantly  with  matches  with  non-enzymatic 
ferredoxins,  but  also  with  enzymes  in  four  different  enzyme  classes.  In  the  multi-domain  class, 
Beta-Lactamase/D-ala  carboxypeptidase  (5.003)  has  the  most  matches  with  penicillinase  (EC 
number  3.5.2)  and  only  one  match  with  a  non-enzyme,  which  also  binds  penicillin  but  has  no 
enzymatic  activity  (Coque  el  al.,  1993).  The  class  of  small  domains  is  represented  only  with  one 
fold,  membrane-bound  rubredoxin-like  (7.035),  and  has  matches  only  with  enzymes.  It  is  possible 
that  some  proteins  classified  as  “non-enzymes”  may  indeed  be  enzymes,  missing  the  corresponding 
EC  number.  In  this  case,  our  analysis  may  be  potentially  useful  in  pointing  to  which  non-enzymes 
may  actually  be  enzymes. 

See  figure  over... 
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Top  Multifunctional  Folds 


Figure  6,  The  Most  Versatile  Functions 

Values  in  the  table  denote  the  number  of  matches  between  a  particular  enzyme  category  (designated 
by  the  first  3  components  of  their  EC  numbers)  and  a  SCOP  1.35  fold  (designated  by  their  fold 
numbers).  This  figure  follows  the  same  conventions  as  Figure  4.  The  rows  are  arranged  in 
decreasing  order  according  to  the  number  of  different  folds  with  which  they  are  associated 
(numbers  shown  in  the  first  column).  A  hash  (“#”)  in  any  cell  indicates  that  its  value  is  greater  than 
10. 

See  figure  over... 
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A/B  A+B  MULTI  SML 


Figure  7,  Multi-functionality  versus  e-value  threshold 

The  graph  shows  how  the  percentage  number  of  multifunctional  enzymatic  domains  varies  as  the 
function  of  the  e-value  threshold.  A  multi-functional  domain  occurs  when  a  particular  domain  in 
SCOP  matches  domains  in  Swissprot  with  different  enzymatic  function.  For  these  calculations,  we 
had  to  use  a  more  minimal  version  of  SCOP  than  the  pdb95d  dataset  referred  to  in  the  methods  to 
prevent  double  matches  —  i.e.  two  SCOP  domains  matching  a  single  Swissprot  domain.  The 
construction  of  this  minimal  SCOP  was  described  previously  (Gerstein,  1998).  Basically,  all  the 
domains  in  SCOP  were  clustered  via  a  multi-linkage  approach  into  990  representative  domains, 
such  that  no  two  domains  matched  each  other  with  a  FastA  e-value  better  than  .01. 


Relative  number  of  domains  with  multiple  functions, 
as  the  function  of  e-value  threshold 
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